1. Background

Here we are concerned with correcting for allometric relationships in morphometric analyses. Many traits are correlated with body size, for example, and studying variation in such metrics must account for the fact that part of the differences we see are due to differences in some underlying quantity (e.g.Ā body size) through allometry. For example, do male lizards have wider heads than females just because they are bigger? Or is there something more that body size does not account for?

Many ways have been proposed to correct for body size, such as dividing by body size, taking the residuals of a linear regression against body size, or principal component analysis. All of these methods suffer from problems, as outlined by Chan and Grismer (2022), who recently proposed to resurrect a ā€œproperā€ allometric correction already shows as superior in the 80s, and which is based on a logarithmic regression.

However, this method, together with all the previous ones, suffers from some important caveats that become apparent when studying (and comparing) multiple groups (1) that differ in their allometric relationship with the horizontal axis and/or (2) that do not occupy the same range along the horizontal axis. Here we argue that these two problems cause all of the proposed corrections to be misleading in the context of group comparison (e.g.Ā studies of sexual dimorphism where the groups are males and females), pretty much for the same reasons.

We will cover both in order with examples. In what follows, we will use male-female differences as our main working example.

# But first we setup R
rm(list = ls())

library(tidyverse)

theme_set(theme_classic())

# Load useful custom functions
source("functions.R")

set.seed(42) # for reproducibility

2. The no-problem case

Let us start with an example where there is no problem. The allometric relationship is the same between the sexes, and both males and females cover the entire range of body sizes, or variable x. We simulate data for males and females using some accessory function we have created (whose source code the reader can check in functions.R but that is not the point here).

# Let us first simulate some data
data <- simulate_data(intercept_f = 0.1, intercept_m = 10, slope_f = 0.5, slope_m = 0.5, sd_y = 1)

# Check them out
data
## # A tibble: 2,000 Ɨ 3
##    sex        x     y
##    <chr>  <dbl> <dbl>
##  1 female  31.9  18.4
##  2 male    31.9  26.5
##  3 female  22.2  12.2
##  4 male    22.2  21.5
##  5 female  26.8  12.5
##  6 male    26.8  22.8
##  7 female  28.2  14.3
##  8 male    28.2  21.2
##  9 female  27.0  12.8
## 10 male    27.0  24.3
## # ℹ 1,990 more rows

If we plot that…

# Plot
plot_data(data)

As you can see we have a variable x, which will serve as our horizontal axis, and variable y that is related to x and differs between the sexes. (The straight lines are linear regressions for each sex.) This case is not a problem in the sense that if we draw a single straight line through the whole dataset and compute the deviation of male and female y to that line, we essentially remove the effect of x. If we perform a global regression through the entire data…

# Here is the global straight line
plot_data(data, midline = TRUE)

And then look at the deviations between y and that straight line…

# Plot the corrected values
plot_data(data, midline = TRUE, correct = TRUE)

We see that the relationship with x has been successfully removed but there still are male-female differences. That is our sexual dimorphism that is not accounted for by body size! Hence, in that case using residuals of a global regression analysis is fine to remove the effect of body size differences between the sexes.

3. Differences in scaling

Things get more complicated when the relationship (irrespective of intercept) is not the same between the groups. This could be e.g.Ā a non-linear, allometric relationship, where y increases faster with body size in males than in females. Let us simulate some non-linear data.

# Simulate non-linear relationship this time
data_nl <- simulate_data(intercept_f = 0.1, intercept_m = 10, slope_f = 0.5, slope_m = 0.5, sd_y = 1, fun = function(x) exp(x / 7))

# Note: the `fun` argument allows us to transform otherwise linear data.

Here, we transformed the previously linear data into some exponential-looking data.

# Plot the allometric data
plot_data(data_nl, midline = TRUE, linear = FALSE)

# Note: setting `linear` to FALSE makes sure we fit an allometric model and not a linear one

The lines are now fitted using an allometric model, which has been regarded as the best correction for allometry even for linear relationships (see Chan and Grismer, 2022). Indeed, those lines go pretty well through both sexes, but fitting a global regression and taking deviations from it gives us the following:

# Plot the corrected allometric data
plot_data(data_nl, midline = TRUE, correct = TRUE, linear = FALSE)

Which is less than ideal as it did not remove the relationship with body size x. The reason for that is that as x increases, the gap between the sexes becomes wider and therefore, so does their respective distance from the global regression. The allometry has not been corrected away because the trend with x is different between the sexes.

Transforming the data so that the relationship looks similar between the sexes therefore sounds like a good solution to that problem. Since the relationship is allometric, what does log-transforming do, for example?

# Plot on a log-scale
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp)

# Note: we have to provide the `inverse` of the re-scaling function.

Now the relationship with x seems to be approximately identical in both sexes, which is what we want, because it means that if we now correct for x, then the relationship with x should be gone and only a male-female offset remain.

# Correct the transformed data
plot_data(data_nl, midline = TRUE, linear = FALSE, correct = TRUE, transform = log, inverse = exp)

Tadaaa! Actually, we could do even better, as we see that the relationship now looks linear but our re-scaled prediction does not seem to capture exactly that relationship (it is curved). What if we re-fit a regression after transforming the data?

# Re-fit after transforming
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp, refit = TRUE)

The regressions seem to fit better. What happens when we correct for x?

# Correct
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp, refit = TRUE, correct = TRUE)

We can also choose to fit a linear regression instead of an allometric one the second time around, since the data look quite linear. This would give us:

# Correct
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp, refit = TRUE, linear_refit = TRUE, correct = TRUE)

# Note: this was done with `linear_refit`.

Pretty god damn perfect.

Note that here we were lucky in the sense that the log-transformation we picked did the job in making the two sex-specific lines parallel to each other, such that when we corrected x away we were only left with an offset. But in practice, we may need to tinker around more to find the right transformation. For example, the function log(x) / (1 + log(x)) has a stronger effect than log(x) in reducing high values, in case the latter did not do the job. Also note that this transformation need not make the relationship linear: what matters is that the trend is the same in the two sexes, so that we can measure an offset that is more-or-less the same along the x axis.

3. Heterogeneity along the horizontal axis

This brings us to the second problem, that of unequal spread of the data along the horizontal axis. This is a common feature in biological data sets, especially when it comes to sex differences. For example, males may have wider heads than females, but they may also be bigger than females. So, how much the dimorphism in head shape can be explained simply by an allometric relationship with body size? Consider the following data set:

# Simulate heterogeneity along the horizontal axis
data_hg <- simulate_data(intercept_f = 0.1, intercept_m = 10, slope_f = 0.2, slope_m = 0.2, sd_y = 1, homogeneity = 0.4)

# Plot heterogeneous data
plot_data(data_hg)

# Note: `homogeneity` shrinks one sex to the left and the other to the right
# when smaller than one

For the sake of the example we have simulated linear relationships with identical slopes in both sexes. What happens if we fit a global regression through both sexes?

# Fit a global regression
plot_data(data_hg, midline = TRUE)

As you may have expected, the global regression passes through both sexes, but is very much not parallel to the sex-specific lines, which represent the true scaling relationship between x and y. Because of heterogeneity of the groups along the horizontal axis, the effect of x that the global regression aims to capture is grossly over-estimated.

This means that when we correct for this ā€œwrongā€ relationship with x, we get:

# Plot the corrected data
plot_data(data_hg, midline = TRUE, correct = TRUE)

Which does not remove the scaling with x at all.

And so, when both sexes covered the same range on the horizontal axis, a global regression managed to capture the ā€œtrueā€ relationship between x and y. That is no longer the case when both sexes do not overlap along x.

To solve that problem, we must use something else than a global regression to estimate this ā€œtrueā€ relationship. We clearly see this relationship when we look at each sex separately, so what if we found a middle ground between these two?

# With middle ground
plot_data(data_hg, midline = TRUE, separate = TRUE)

This looks much better, and if we now correct for x:

# Correct
plot_data(data_hg, midline = TRUE, separate = TRUE, correct = TRUE)

The dependency on x is gone and what is left is the sexual dimorphism between males and females!

4. Combining problems

Let us now combine both problems into one, very problematic data set where not only are the (non-linear) relationships of y with x different in both sexes but in addition, both sexes are unequally distributed along the horizontal axis.

# Simulate problematic data
data_pb <- simulate_data(intercept_f = 0.1, intercept_m = 0.5, slope_f = 0.5, slope_m = 2, fun = function(x) exp(x / 30), homogeneity = 0.7)

# Plot
plot_data(data_pb, linear = FALSE)

What is the sex difference that is not attributable to differences in body size in this data set where the sexes differ not only in body size but also in how y scales with body size?

Well, we can combine the techniques we have seen above.

First, we have heterogeneous data, so we know that we should use a mid-line between the two separate regressions rather than a global regression.

# Plot with mid-line
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE)

But the relationship is different in males and females so the correction will not work…

# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, correct = TRUE)

This suggests we must transform the scale of the data. What about a log-transformation?

# Log-transform
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = log, inverse = exp)

This is better, but not great, as the relationship with x is still present in the corrected data:

# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = log, inverse = exp, correct = TRUE)

A transformation by the function log(x) / (1 + log(x)) seems more appropriate.

# New transformation
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)))

# Note: exp(y / (1 - y)) is the inverse of log(x) / (1 + log(x)).

However, the corrected data still does not look optimal:

# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), correct = TRUE)

But that is because we have not re-fitted the model after transforming the data. What if we do that?

# Re-fit after transforming
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), refit = TRUE)
## Warning in new_fun(transform(y)): NaNs produced

## Warning in new_fun(transform(y)): NaNs produced

This looks much better. And again, the relationship looks linear on this new scale so we might as well include that when we re-fit the model.

# Linear re--fitting
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), refit = TRUE, linear_refit = TRUE)

If we now correct for x:

# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), refit = TRUE, linear_refit = TRUE, correct = TRUE)

Tadaaa!

5. Future avenues

More than two groups? Make this a package? Or a paper?

6. Specifications

This report was generated with the following specifications:

# Show session information
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Amsterdam
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.4    
##  [5] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
##  [9] ggplot2_3.5.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5     jsonlite_1.8.4   highr_0.10       compiler_4.3.3  
##  [5] tidyselect_1.2.1 jquerylib_0.1.4  scales_1.3.0     yaml_2.3.7      
##  [9] fastmap_1.1.1    R6_2.5.1         labeling_0.4.3   generics_0.1.3  
## [13] knitr_1.42       munsell_0.5.1    bslib_0.4.2      pillar_1.9.0    
## [17] tzdb_0.3.0       rlang_1.1.3      utf8_1.2.4       stringi_1.7.12  
## [21] cachem_1.0.7     xfun_0.39        sass_0.4.5       timechange_0.2.0
## [25] cli_3.6.2        withr_3.0.0      magrittr_2.0.3   digest_0.6.36   
## [29] grid_4.3.3       rstudioapi_0.14  hms_1.1.3        lifecycle_1.0.4 
## [33] vctrs_0.6.5      evaluate_0.20    glue_1.7.0       farver_2.1.2    
## [37] fansi_1.0.6      colorspace_2.1-0 rmarkdown_2.21   tools_4.3.3     
## [41] pkgconfig_2.0.3  htmltools_0.5.5